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Abstract 

We propose a doubly stochastic primal-dual coordinate optimization algorithm for empirical 
risk minimization, which can be formulated as a bilinear saddle-point problem. In each iteration, 
our method randomly samples a block of coordinates of the primal and dual solutions to update. 
The linear convergence of our method could be established in terms of 1) the distance from the 
current iterate to the optimal solution and 2) the primal-dual objective gap. We show that 
the proposed method has a lower overall complexity than existing coordinate methods when 
either the data matrix has a factorized structure or the proximal mapping on each block is 
computationally expensive, e.g., involving an eigenvalue decomposition. The efficiency of the 
proposed method is confirmed by empirical studies on several real applications, such as the 
multi-task large margin nearest neighbor problem. 


1 Introduction 

We consider regularized empirical risk minimization (ERM) problems of the following form: 

min I P(x) = 1 V (j)i(afx) + g(x) 1 , (1) 

xeRp I n y-j' I 

where a \,..., a n el p are n data points with p features, <f>i : M —> M is a convex loss function of 
the linear predictor ajx , for i = 1,... ,n, and g : W —> M is a convex regularization function for 
the coefficient vector x 6 in the linear predictor. We assume g has a decomposable structure, 
namely, 


p 

g{ x ) = ^2gj(xj), 

3 =i 


( 2 ) 


1 



where gj : M —> M is only a function of Xj, the j- th coordinate of x. For simplicity, we consider 
a univariate gj at this moment. In Section [5j the proposed method will be generalized for the 
problems having a block-wise decomposable structure with multivariate gj. We further make the 
following assumptions: 

Assumption 1. For any a, (3 £ M, 

• gj is X-strongly convex for j = 1,2,... ,p, i.e., gj(a) > gj{/3) + g'j(/3)(a - /3) + §(a - /3) 2 ; 

• (pi is (I /7 )-smooth for i = 1,2,..., n, i.e., cpi(a) < (pi(/3) + V(pi(/3)(a - /3) + ^(a - /3) 2 . 

The problem ([Tj) captures many applications in business analytics, statistics, machine learning 
and data mining, and has triggered many studies in the optimization community. Typically, for 
each data point a*, there is an associated response value bi £ M, which can be continuous (in 
regression problems) or discrete (in classification problems). The examples of loss function (pi{-) 
associated to (aj, 6 j) include: 

• Square Loss, where a* £ ML, bi £ M and (pi(z) = ^(z — bi) 2 , which corresponds to linear 
regression problem; 

• Sigmoid Loss, where a* £ M p , bi £ {1, —1} and (pi(z) = log(l + exp(— biz)), which corresponds 
to logistic regression problem; 

• Smooth Hinge Loss, where ai £ M p , bi £ {1, —1} and 


z = 


7, — biZ 

|(1 - bizf 


if biz > 1 
if biZ < 0 
otherwise. 


(3) 


which corresponds to the smooth support vector machine problem. 


In fact, if appropriate reformulation is conducted, many other problems can also be reduced to ([Tj), 
for example, the multi-task large margin nearest neighbor metric learning (MT-LMNN) problem 


(See Section 5.3). 

The commonly used regularization terms include the ^-regularization gj{x) = with A > 0 
and £2 + ^i-regularization gj(x) = + Ai|x| with Ai, A 2 > 0 . 

We often call ([Tj) the primal problem and its conjugate dual problem is 


max { D(y) = —g* ( — —^ 
y£R n I v ; V n 


n 


$ ( v* 


(4) 


i— 1 


where A = [ai, a^, ■ ■ ■, a n ] T £ M nxp and (p* and g* are the convex conjugates of (pi and g, respec¬ 
tively, meaning that g*(v) = inax, ie gp (u, v) —g{u) and (a) = max^g a/3 — cpi{/3). It is well-known 
in convex analysis that, under Assumption1 1 1 g* is ^-smooth and cp* is 7 -strongly convex. In this 
paper, instead of considering purely (JTj) or (jdj), we are interested in their associated saddle-point 
problem: 

f 1 1 n ] 

min max < q(x) -|— y T Ax -z • (5) 

xeRpycR™ j yv n K '{ 


2 





Let x* and y* be the optimal solutions of 0 and 0 > respectively. It is known that the pair (x*, y*) 
is a saddle point of ([5]) in the sense that 


x* = argmin \ g(x) + ~{y*) T Ax -V <t>*(y*) \ , 

_TTT) n I 77, 71 Z ^ I 


xeRp 


n 


2—1 


y* = argmax < y(x*) + -y T Ax* (y*) > . 

In n I 


( 6 ) 

(7) 


The contributions of this paper can be highlighted as follow: 


• We propose a doubly stochastic primal-dual coordinate (DSPDC) method for solving prob¬ 
lem ([5]) that randomly samples q out of p primal and m out of n dual coordinates to update 
in each iteration. 


• We show that DSPDC method generates a sequence of primal-dual iterates that linearly 
converges to (x*,y*) and the primal-dual objective gap along this sequence also linearly 
converges to zero. 

• We generalize this approach to bilinear saddle-point problems with a block-wise decomposable 
structure, and show a similar iteration complexity for finding an e-optimal solution. 

• We show that the proposed method has a lower overall complexity than existing coordinate 
methods when either the data matrix has a factorized structure or the proximal mapping on 
each block is computationally expensive, e.g., involving an eigenvalue decomposition. 

• Our experiments confirm the efficiency of DSPDC on both synthetic and real datasets in 
various scenarios. A notable application is the multi-task large margin nearest neighbor (MT- 
LMNN) metric learning problem. 


Notation Before presenting our approach, we first introduce the notations that will be used 
throughout the paper. Let [d] represent the set {1, 2,..., d}. For v E M rf , let Vi be its z-th coordinate 
for i E [d] and vj be a sub-vector of v that consists of the coordinates of v indexed by a set I C [d]. 
Given an n x p matrix W, we denote its z-th row and y-th column by Wq and W J , respectively. For 
I C [n] and J C [p], the matrices Wj and W J represent sub-matrices of W that consist of the rows 
indexed by I and columns indexed by J, respectively. We denote the entry of W in z-th row and 
y-th column by W- and let Wf be sub-matrix of W where the rows indexed by / intersect with 
the columns indexed by J. 

Let (•, •) be the inner product in a Euclidean space, || • || be the ^-norm of a vector and || • H 2 
and || ■ ||p be the spectral norm and the Frobenius norm of a matrix, respectively. For integers 
q E \p\ and m E [re], we define A 9jm as a scale constant of the data as follows 


A 


q,m 


max 11 A'j 11 o. 

IC[n],Jc]p],\I\=m,\J\=q 


( 8 ) 


The maximum i 2 norm of data points is therefore y / A p q. The condition number of problems |l| , (J4J) , 
and ([5]) is usually defined as 

K=^-, (9) 

A7 

which affects the iteration complexity of many first-order methods. 
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Paper Outline The rest of this paper is organized as follows. Section [2] briefly introduces the 
existing work that are related to ours. Section [3] immediately summarizes the results of this paper, 
along with rough comparisons against the existing methods. In Section [4j we propose the DSPDC 
algorithm, followed by its theoretical convergence analysis and an efficient implementation for 
factorized data. The algorithm is extended to the block coordinate update scheme in Section [5j 
which can be applied to two important problems including the multi-task large margin nearest 
neighbor. We conduct empirical studies in Section [ 6 ] to confirm the efficiency of the proposed 
method, and conclude our paper in Section [7j All the proofs are deferred to appendix. 


2 Related Work 


To find an e-optimal solution of problem ([!]), Q or the overall complexity of an iterative 
method is defined as the per-iteration computational cost multiplied by the total number of required 
iterations (called iteration complexity) 

(2004), 


2005 , Nemirovski 


Deterministic first-order methods such as Nesterov 

12014 


2004 


Chambolle and Pock 2011 , Yu et al 


have to compute a full 

gradient in each iteration by going through all p features of all n instances at a per-iteration cost 
of 0(np), which can be inefficient for big data. Therefore, stochastic optimization methods that 
sample one instance or one feature in each iteration become more popular. There are two major 
categories of stochastic optimization algorithms that are studied actively in recent years: stochastic 
gradient methods and stochastic coordinate methods. The DSPDC method we propose belongs to 
the second category. 

Recently, there have been increasing interests in stochastic variance reduced gradient (SVRG) 


methods Johnson and Zhang, 

2013, 

Xiao and Zhang 

2014, 

Nitanda 

2014, 

Konecny and Richtarik( 

2013, 

Allen-Zhu, 

2016 . SVRG runs in multiple stages. At each stage, it computes a full gradi- 


ent and then performs O(k) iterative updates with stochastic gradients constructed by sampled 
instances. Since the full gradient is computed only once in each stage, SVRG has a per-iteration 
cost of 0(p), which is lower than deterministic gradient methods, and it needs 0 ((n + k) log(l/e)) 
iterations to find an e-optimal solution for problem 0. so that the overall complexity of SVRG is 


0(fnp+up) log(l/e)). Recently, an accelerated SVRG method, named Katyusha Allen-Zhu 2016 


further reduces the iteration complexity of SVRG to 0({n + y/ntf) log(l/e)) while maintains the 
0(p) per-iteration cost so that it achieves an overall complexity of 0((np + y/nnp) log(l/e)). The 
aforementioned overall complexities are obtained when a uniform sampling scheme is applied in 
the construction of stochastic gradient. One can further reduce the k term in these complexities by 
using a non-uniform sampling scheme as pointed out, for example, by Xiao and Zhang 2014 . How¬ 


ever, in this paper, the complexity of each algorithm we present and compare is based on a uniform 
sampling scheme unless stated otherwise. After the earlier version of our draft was posted online^ 
Balamurugan and Bach 120161 developed an accelerated SVRG method (ASVRG-SP) for solving 

the saddle-point formulation (5), which has a complexity 2 of 0((np + n p\j ) log(l/e)) 


by uniform sampling and 0((np + y/np 


max{n,p}||.4||] 

Vy 


)log(l/e)) by non-uniform sampling. Here 


and in the rest of the paper, O contains some logarithmic factors. 


Stochastic incremental gradient methods 

Schmidt et al. 

2013 

Roux et al. 

2012 , 

Defazio et al. 

2014ab, 

Mairal 

2015 

Lan and Zhou 

2015, 

Balamurugan and Bach, 2016 is 

also widely studied 


1 https://arxiv.org/pdf/1508.03390v2.pdf 

J This complexity is achieved by the individual-split version of ASVRG-SP. 
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in recent literature. Different from SVRG, stochastic incremental gradient method computes a full 
gradient only once at the beginning, but maintains and updates the average of historical stochas¬ 
tic gradients using one sampled instance per iteration. Standard stochastic incremental gradient 

2012, Defazio et al., 2014a! b 


methods Schmidt et ah, 2013, Roux et ah 


Mairal, 2015 have a 


per-iteration cost of 0(p) just as SVRG and need 0((n + n) log(l/e)) iterations to find an e-optimal 
solution so that their overall complexity is the same as SVRG. Moreover, an accelerated stochastic 
incremental gradients method, named RPDG |Lan and Zhou, 2015], achieves an iteration complex¬ 
ity of only 0((n + y/rm) log(l/e)) and a per-iteration cost of 0(p) so that its overall complexity is 
the same as Katyusha. The iteration complexity of RPDF and Katyusha is proved to be optimal 
by Lan and Zhou 12015 . 

In contrast to stochastic gradient methods, stochastic coordinate method works by updating 


randomly sampled coordinates of decision variables Nesterov 

2012 , 

Richtarik and Takac 

2014 

Shalev-Shwartz and Tewari 2009, Fercoq and Richtrik, 2013, Lu and Xiao, 2015, Dang and Lan 

2014, Lin et al. 2015, Deng et al. 2015, Allen-Zhu et al. 2016 Nesterov and Stich, 2016, Qu anc 

Richtarik, 

2016b a 

Richtarik and Takac[ 

2016], 

Shalev-Shwartz and Zhang 

2013a| 

b|a proposed a 


ascent (SDCA) method to solve the dual formulation (|4j). SDCA has 
an iteration complexity of 0({n + k) log(l/e)) and has been further improved to the accelerated 
SDCA (ASDCA) method [Shalev-Shwartz and Zhang 2013b| that achieves an iteration complexity 
of 0((n + yJnK) log(l/e)). The optimal iteration complexity 0((n + y/mc) log(l/e)) is obtained by 
the accelerated proximal coordinate gradient (APCG) method |Lin et al., 20151 when it is applied 
to the dual problem ([4 ). Extending the deterministic algorithm by |Chambolle and Pock 201 1| for 
saddle-point problems, Zhang and Xiao 12015 recently proposed a stochastic primal-dual coordinate 


(SPDC) method for ([5]), which alternates between maximizing over a randomly chosen dual variable 
and minimizing over all primal variables and also achieves the optimal 0((n + y/rm) log(l/e)) 
iteration complexity. The per-iteration cost is 0(p ) in all of these coordinate methods. Note 
that, when applied to the primal problem 0- APCG samples a feature of data in each iterative 


iterations, which is also optimal according to Lan and Zhou 

2015]. 

Some recent works Zhao et al. 2014, Dai et al., 2014, Konecny et al. 2014, Matsushima et al. 

2014, Dang and Lan, 2015 Deng et al. 

2015 made attempts in combining stochastic gradient and 

stochastic coordinate. Zhao et al. 2014 

, Matsushima et al. 

2014 , Dang and Lan |2015 proposed 


randomized block coordinate methods, which utilize stochastic partial gradient of the selected block 
based on randomly sampled instances and features in each iteration. However, these methods 
face a constant variance of stochastic partial gradient so that they need 0(l/e) iterations. These 
techniques are further improved in Konecny et al. 2014 , |Zhao et al. [2014] with the stochastic 
variance reduced partial gradient but only obtain the sub-optimal 0((n + «)log(l/e)) iteration 
complexity. 


3 Summary of Results 

Although the aforementioned stochastic coordinate methods have achieved great performances 
on ERM problem ([5]), they either only sample over primal coordinates or only sample over dual 
coordinates to update in each iteration. Therefore, it is natural to ask the following questions. 

• What is the iteration complexity of a coordinate method for problem 0 that samples both 
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primal and dual coordinates to update in each iteration? 


When is this type of algorithm has a lower overall complexity than purely primal and purely 
dual coordinate methods? 


To contribute to the answers to these questions, we propose the DSPDC method in Section [4] 
that samples over both features and instances of dataset by randomly choosing the associated 
primal and dual coordinates to update in each iteration. 

To answer the first question, we show in Theorem [l] and [2] that, if q primal and m dual co¬ 
ordinates are uniformly sampled and updated in each iteration, the number of iterations DSPDC 

needs to find an e-optimal solution for (5) is + max{^, |}) log(l/e)). This iteration 

complexity is interesting since it matches the optimal 0((n + ^/nK)\og{l/e)) iteration complexity 
of du al coordinate methods |Shalev-Shwartz and Zhang, 2013b, Lin et ah, 2015[ Zhang and Xiao 


2015 


when ( q,m) = (p, 1), and also matches the optimal Q((p + pyj log(l/e)) iteration com¬ 


plexity of primal coordinate methods Lu and Xiao, 2015[ [Lin et al. |2015] when ( q,m ) = (1 ,n) 


In Section [5] we further generalize DSPDC and its complexity to a bilinear saddle-point problem 
with a block-wise decomposable structure. 

To study the second question, we compare different coordinate algorithms based on the overall 
complexity for finding an e-optimal solution. For most ERM problems, the per-iteration cost of 
SPDC is 0(p) and its the overall complexity is 0(np + p^/rm log (1/e)). When ( q,m) = (1,1) and 
without any assumptions on the sparsity of data, the per-iteration cost of DSPDC is 0(min{n,p}) 
due to a full-dimensional inner product in the algorithm. If n > p, which is true for most ERM 

problems, the overall complexity of DSPDC becomes 0((np + yj n ^’ 1 p 2 ) log(l/e)), which is not 

lower than that of SPDC in generaj^] Nevertheless, we identify two important cases where DSPDC 
has a lower overall complexity than SPDC and other existing coordinate methods. 

The first case is when data A has a factorized structure , namely, A = UV with U E M nxrf , 
V E M. dxp and d < min{n,p}. The ERM problem with factorized data arises when (random) 
dimension/instance reduction or matrix sketching/factorization techniques are applied to A in 
order to reduce the storage and computational cost. More examples are provided in Section 4.2 In 


this case, choosing ( q , m .) = (1,1) and using an efficient implementation, our DSPDC has an overall 
complexity of 0((nd+ pd) log(l/e)), better than the 0((npd+ yfmpd) log(l/e)) complexity 

of SPDC with the same efficient implementation. See Table [l] for comparisons with more existing 
techniques for this class of problems. 

The second case is when solving a block-wise decomposable bilinear saddle-point problem 
where the proximal mapping on each block is computationally expensive. The applications with 
this property include trace regression [Slawski et ah 2015 and distance metric learning [Wein- 


berger and Saul, 2008, 2009, Parameswaran and Weinberger, 2010 , where each block of variables 


needs to be a d x d positive semi-definite matrix so that the proximal mapping involves an eigen¬ 
value decomposition with a complexity of 0(d 3 ). When ( q,m ) = (1,1) and n > p , DSPDC 
requires solving eigenvalue decomposition only for one block of variables so that its overall com¬ 
plexity is 0((d 3 + pd?)(n + log(l/e)) as shown in Section 5 I which is lower than the 

0((npd 3 + x/rmpd 3 ) log(l/e)) overall complexity of SPDC when 'J~K\^p < A P) \d. See Table [ 2 ] for 
comparisons with more existing techniques for this class of problems. 


3 Note that A„ 1 > Ai 1 > ^e±L and k > > -. 


— A7 — p ■ 
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Algorithm 1 Doubly Stochastic Primal-Dual Coordinate (DSPDC) Method 
Input: i) _ x (o) _ x (o) £ y( i) _ y( o) _ y( o) g anc [ p aram eters ( 9 , r, a). 

Output: x^ and y^ 

1: for t = 0,1, 2,..., T — 1 do 

2 : Uniformly and randomly choose I C [n] and J C [p\ of sizes m and q, respectively. 

3: Update the primal and dual coordinates 




y(t + l) 


T (t+1) 

x (f+1) 


| argmax^ eR ^ - yf } ) 2 } if * € J, ^ 

y(t) + ™(y(t+l)_ y (t))' (11 ) 

m 

f argmin aeR {\(A^y^)a + gj (a) + £(a - x^) 2 } if j G J, 

1 W J ., . , , ( 12 ) 

[ ^ if j i J , 

+ (0 + l)(x ( - i+1 - ) — x®). (13) 


4: end for 


Although it is not our main focus, we note that applying a non-uniform sampling on the 
primal and dual coordinates can further reduce the overall complexity of our DSPDC just as other 


coordinate methods Zhang and Xiao 

2015, 

Csiba and Richtarik, 

2016 

Richtarik and Takac 

2016] 

Csiba et ah, 

2015, 

Qu and Richtarik, 

2016b a, Allen-Zhu et al., 2016 . 


4 Doubly Stochastic Primal-Dual Coordinate Method 


4.1 Algorithm and Convergence Properties 

In this section, we propose the doubly stochastic primal-dual coordinate (DSPDC) method 
in Algorithm [I] for problem ([E]. In Algorithm [lj the primal and dual solutions (x^ +1 \ y( t+1 )) 
are updated as ( ]~i~2j ) and ( fTo] ) in the randomly selected q and m coordinates indexed by J and 
/, respectively] These updates utilize the first-order information provided by the vectors Ax^> 
and A 1 y( t+l ^ where (x^,y(H-!)) are updated using the momentum steps (13) and ( 11 ) which are 
commonly used to accelerate gradient (AG) methods [Nestero v, 2004, 2005J — Algorithm [I] requires 
three control parameters 0, r and a and its convergence is ensured after a proper choice of these 
parameters as shown in Theorem [l] The proofs of all theorems are deferred to the Appendix. 


Theorem 1. Suppose 6, t and a in Algorithm^ 7] are chosen so that 

p/q __ nmq p 


e = p - 


l_A_np , i - n P4 

A-vn ma ' mdJt 4 m i c > 


T(J = 


+ P ~ = 


ApA ' 2q\r q 


n n 
2m r ya m 


(14) 


4 Here, we hide the dependency of I and J on t to simplify the notation. 
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where A is any constant such that A > A qm . For each t > 0, Algorithm^ 7] guarantees 
+ EH ®* - z «|| 2 + E || s ,* - yW || 2 

\2gr q J V4 ma mJ 


< 1 - 


max 


/ V n\ 
\q' m j 


A np 
A 7 n mq , 


+ IU* _ £(0)112 + ( _Jf_ 4 - T\ ||„* _ 7 ,( 0 ) 112 


2 qr + q 


( 2 W + m) 


Remark 1 For a given A, the values of r and cr can be solved from the last two equations of 0 
in closed forms: 


pin p 


r = At --- +\ --- + 


qA 1 \m q 


n p\ 4(?rp) 2 A 


n 


<7 = 


7717 \ m 


— I + WI —-1 + 


m q J (m,q) 2 n\^ 

n p\ 2 4(np) 2 A 




(mg) 2 nA 7 


(15) 

(16) 


which are referred to as the primal and dual step size, respectively. If both primal and dual coordi¬ 
nates are sampled at the same ratio, i.e., p = r ^, then we have the following simplified version: 


m y 
= ~2~ V AnA’ 


a = 


m 

2 



(17) 


According to the convergence rate above, the best choice of A is A q ^ m . Although the exact compu¬ 
tation of A q)m by definition |A|) may be costly, for instance, when q « | or m ss |, it is tractable 

when q and m are close to 1 or close to p and n. In practice, we suggest choosing A = ,nqh P ’ 1 as 
an approximation of A q)m , which provides reasonably good empirical performance (see Section [(|). 

Besides the distance to the saddle-point ( x*,y *), a useful quality measure for the solution 
(a;(*), s/(*)) is its primal-dual objective gap, P(x^) — D{y^), because it can be evaluated in each 
iteration and used as a stopping criterion in practice. The next theorem establishes the convergence 
rate of the primal-dual objective gap ensured by DSPDC. 


Theorem 2. Suppose r and a are chosen as (If) while 6 is replaced by 


e = p -- 

q 


p/q 


2 A m+2 rn p} 
\ A-vn ma o - 1 


A 7 n mq ' m 7 q 

in Algorithm [7| For each t > 0, Algorithm [I] guarantees 
E [P(x (t) ) - 

1 


< 


2 / A np + 2 m ax{^,2) 

\ A -yn ma m‘ a J 


1 o ,_ 

X’jn mq 

P pA 
2qr 2 q 


H 

1 1 

3 

% 

sE 

to 

iaii 2 \ i 

An 2 J [ 

1 min / -, — ] 

• min { At 

“I 1 


L 1 Q ’ m J 

\ « ’ 

m J J 


(18) 


X(0) ~ ^ + ( 2 ma + 2m) ^ “ V *^ + max { “> fn } ~ D{yW) ) 


























According to Theorem [l] and [2] in order to obtain a pair of primal and dual solutions with an 
expected e distance to (x*,y*), i.e., E[||x^ — a;*11 2 ] < e and E[||y^ — y*|| 2 ] < e, or with an expected 
e objective gap, Algorithm |T] needs 


t = O 



iterations when A = Ag )m . This iteration complexity is interesting since it matches the optimal 


0((n + y/mi) log Q)) iteration complexity of dual coordinate methods such as SPDC Zhang and 
Xiao, 2015 and others |Shalev-Shwartz and Zhang[ 2013b, Lin et ah, 2015 when (q,m ) = (p7v, 


and also matches the optimal 0((p + log Q)) iteration complexity of primal coordinate 

methods jLu and Xiao 2015, Lin et al.[ 2015] when (g, m) = (l,n). 

To efficiently implement Algorithm [lj we just need to maintain and efficiently update either 
Ax W or A T y ^ t ' ) , depending on whether ^ or | is larger. If ^ > |, we should maintain A T y W during 
the algorithm, which is used in (12) and can be updated in 0(mp ) time. We will then directly 
compute ( Ai,x w) for i g 1 in (10) in 0{mp ) time. In fact, this is how SPDC is implemented 
in Zhang and Xiao 2015 where q = p. On the other hand, if ^ ^, it is more efficient to maintain 

Ax^ and update it in 0(qn) time and compute (A- 7 , for j E J in (12) in O(qn) time. Hence, 

the overall complexity for DSPDC to find an e-optimal solution is 0((np + \J n ^ m y) log (^)) 


when ^ > 2 anc [ 0((np + log (^)) when ^ < |. Since the overall complexity of 

SPDC is O ((up + yjunmp ) log Q)) when ^ DSPDC method is not more efficient for general 
data matrix. However, in the next section, we show that DSPDC has an efficient implementation 
for factorized data matrix which leads to a lower overall complexity than SPDC with the same 
implementation. 


4.2 Efficient Implementation for Factorized Data Matrix 

In this section, we assume that the data matrix A in (|5j) has a factorized structure A = UV where 
U E W nxd and V E M. dxp with d < min{n,p}. Such a matrix A is often obtained as a low-rank or 
denoised approximation of raw data matrix. Recently, there emerges a surge of interests of using 


factorized data to alleviate the computational cost for big data. For example, Pham and Ghaoui 
20151 proposed to use a low-rank approximation X ~ UV = A for data matrix X to solve multiple 
instances of lasso problems. For solving big data kernel learning problems, the Nystrom methods, 


that approximates a n x n kernel matrix K by US^U T with U E 


p nxd 


S E 


p dxd 


and d < n, 


has become a popular method [Yang et al. 2012 . Moreover, recent advances on fast randomized 
algorithms Halko et al. 2011 for finding a low-rank approximation of a matrix render the proposed 


coordinate optimization algorithm more attractive for tackling factorized big data problems. 

The factorized A also appears often in the problem of sparse recovery from the randomized 
feature reduction or randomized instance reduction of (|I|. The sparse recovery problem from 
randomized feature reduction can be also formulated into ([5]) as 


mm max 

a:SKP j/SM" 


^2 I, ||2 . || |i 

y If || 2 + Ai |f||i 


+ \ T XG T Gx-~Y j( t>*{y i )\ 

n n 

i =l ) 


(19) 


where X is the original n X p raw data matrix, G is a d x p random measurement matrix with 
d < p, and the actual data matrix for ^ is A = XG t G with U = XG t and V = G. This 
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Algorithm 

Num. of Iter, (x log(^)) 

Per-Iter. Cost 

Overall Compl. (xlog(^)) 

DSPDC 

»W%r 

d 

nd + pd^J ^ 

SPDC 

ASDCA 

APCG 

RPDG 


pd 

npd + pd^J * A7 

SDCA 

SAGA 

"+^ 

pd 

npd + pd^~ 

SVRG 

Outer: 1 

Inner: 'V ’ 1 

A7 

nd 

pd 

nd + pd 

ASVRG-SP 

Outer: + 1 

Inner: 

nd 

d 

nd + nd^ pmax{ Y M ' n} 


Table 1: The overall complexity of finding an e-optimal solution when A = UV, n > p and U and 
V (but not A) are stored in memory. We choose (q,m) = (1,1) in DSPDC. 


approximation approach has been employed to reduce the computational cost of solving under¬ 
constrained least-squares problem |Mahoney 2011 Wang et al., 2016 . Similarly, the randomized 


Drineas et ah, 2011, Wang et al 


instance reduction 

in (19) with G T GX, where G is a d x n random measurement matrix with d < n 


2016 can be applied by replacing XG T G 

and the data 


matrix A = G T GX with U = G T and V = GX. 

To solve ((H]) with A = UV, we implement DSPDC by maintaining the vectors = U T y( t ' 1 and 
yW = Vx W and updating them in 0(dm ) and 0(dq ) time, respectively, in each iteration. Then, 
we can obtain in (10) in 0(dm ) time by evaluating {'Ui,v for each i G J, where U t is 

the ith row of U. Similarly, we can obtain (Aj,y( t+1 ^ in ( |l2[ ) in O(dq) time by taking (V 3 ,v^^ 
for each j 6 J, where V 3 is the jth column of V. This leads to an efficient implementation of 
DSPDC described as in Algorithm [ 2 ] whose per-iteration cost is 0(dm + dq), lower than the 0(mp ) 
or O(qn) cost when A is not factorized. 

To make a clear comparison between DSPDC and other methods when applied to factorized 
data, in Table[lJ we summarize their numbers of iterations and per-iteration costs (when A = 

For all methods in comparison, we assume A is too large so that only U and V are stored in mem¬ 
ory, which is the typical situation when applying random reduction. Moreover, the aforementioned 
efficient implementation in DSPDC (if applicable) has been also applied to other methods to re¬ 
duce their per-iteration cost. In Table [lj we assume n > p and ( q,m ) = (1,1) and omit all the 
big-0 notations for simplicity. For ASVRG-SP, we present the complexity of its individual-split 
version with uniform sampling. According to the last column of Table [lj our DSPDC with efficient 
implementation has the lowest overall complexity among these methods. 


5 For SVRG and ASVRG-SP, we present their numbers of outer and inner iterations and per-iteration costs 
separately. 
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Algorithm 2 Efficient Implementation of Algorithm [l] for Factorized Data 
Input: 2 ^ x ) = aj(°) = a;( 0 ) G M p , y( = y(°) G M n , and parameters (0, r, cr) 

Initialize: = U T y^°\ v^ = Vx^\u^ = U T y^°\ v = Ux® 

Iterate: 

For t = 0,1, 2,..., T — 1 


Uniformly and randomly choose I C [n] and J C \p\ of sizes m and q, respectively. 


w (m) 

h (m) 


T ( t+1 ) 

x j 

h (m) 


f argmax^jj - ^(/3 - yf } ) 2 } if * G /, 

1 2/f } if i i I, 

« (t) + t/ T (y {m) — y(*)), 

+ -C/ T (2/ (m) - yW), 

m 

f argmin QeR | ± (W, «( t+1 ))a + gj (a) + ^ (a - xf) 2 } if j G J, 
\ xf if j 0 J, 

v® + U(x( t+1 ) -#), 

+ (0 + l)U(x^ +1 i — x^). 


( 20 ) 

( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


Output: x^ and y^ 


5 Extension with Block Coordinate Updates 

With block-wise sampling and updates, DSPDC can be easily generalized and applied to the bilinear 
saddle-point problem ([5]) with a block-decomposable structure and a similar linear convergence 
rate can be obtained. Although this is a straightforward extension, it is worth showing that, when 
the proximal mapping on each block is computationally expensive, DSPDC can achieve a lower 
complexity than other coordinate methods. In this section, we first extend DSPDC to its block 
coordinate update version, and then identify the scenarios where such an extension has a lower 
overall complexity than other methods. 


5.1 Algorithm and Convergence Properties 

We partition the space M p into p subspaces as M p = M 91 x M 92 x ■ ■ ■ x M 9 ? such that Y^j=i Qj = P an d 
partition the space M n into n subspaces as R n = M mi x M m2 x ■ ■ ■ x R m " such that Yli= l = 
With a little abuse of notation, we represent the corresponding partitions of x G M p and y G M n 
as x = (xi, x 2 ,..., x p ) with xj G for j = 1,... ,p and y = (yi, y 2 , • ■ •, y n ) with y* G M mi for 
i = 1 ,..., n, respectively. 

We consider the following bilinear saddle-point problem 


min max 

xeRP yeR n 


+ ^y TAx 

I ' 

J =1 


1 

n 


Y (y<) 

2—1 



(26) 


where gj : M 9j ' —> M and (/)* : M m * -4 M are functions of Xj and y,;, respectively. Moreover, we assume 
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gj and <f>* are strongly convex with strong convexity parameters of A > 0 and 7 > 0, respectively. 
Due to the partitions on x E and y E M n , we partition the matrix A into blocks accordingly so 
that 

p n 

y 1 Ax A i x i- 

j =1 i =1 

where Aj E M. miXq i is the block of A corresponding to xj and y ? ;. 

It is easy to see that the problem ([5]) is a special case of (26) when q 3 = to,; = 1 for j = 1 ..... p 
and i = 1,... ,n, p = p and n = n. The scale constant defined in ([8]) can be similarly generalized 
as 


Aq : m — 


max 

IC[n],Jc[p],\I\=m,\J\=q 


I A/1 


(27) 


where A j is sub-matrix of A consisting of each block Aj with i E I and j E J. 

Let A i = (A), ■ • • , A?) and A J = ((A{) r , • • • , (A J n ) T ) T . Given these correspondings between 
© and ([26]), DSPDC can be easily extended for solving 


,(*+i) 


x 


(t+i) _ 


argmax i9eIR . 




by replacing (|10|) and (|12|) with 

±B T AvxW-ffl-illfl- 


2cr I 


yf 


W ||2 


} 


if * E J, 
if i $ I, 


(28) 


argmin aeR ^ j^a^A^y^ 1 ) + ^(a) + ^||a - xf } || 2 } if j E J, 


r (t) 


if j i J , 


(29) 


respectively, and y ( ^ and x^i are updated in the same way as (11) and (13). 


For this extension, the convergence results similar to Theorem [l] and Theorem [2] can be easily 
derived with almost the same proof. We skip the proofs but directly state the results. To find 


a pair of primal-dual solutions for (26) which either has an e-distance to the optimal solution or 


has an e-primal-dual objective gap, the number of iterations Algorithm [I] (with by (10) and (12) 
replaced by (28) and (29)) needs is 


t = O 



5.2 Matrix Risk Minimization 

In this section, we study the theoretical performance of DSPDC method when the block updating 
step (28) or (29) has a high computational cost due to eigenvalue decomposition. Let §1 be the 


set of d x d positive semi-definite matrices. The problem we consider is a general multiple-matrix 
risk minimization which is formulated as 


^. 


nun 


(30) 


where is a d x d data matrix, fa is (l/7)-smooth convex loss function applied to the linear pre¬ 
diction Ylj= i( D i , Xj ) and A is a regularization parameter. The associated saddle-point formulation 
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Algorithm 

Num. of Iter, (x log(^)) 

Per-Iter. Cost 

Overall Compl. (x log(^)) 

DSPDC 

n + P\/ A 7 

pd 2 + d 3 

(■ d 2 p + d 3 ){n + psj ) 

SPDC 

ASDCA 

APCG 

RPDG 

n+s/W 

pd 3 

npd 3 + pd 3 \J ^ 

SDCA 

SAGA 


pd 3 

npd 3 + pd 3 

SVRG 

Outer: 1 

Inner: 

A 7 

pnd 2 
pd 3 

npd 2 + pd 3 

ASVRG-SP 

Outer: + 1 

Inner: 

pnd 2 

d 3 


npd 2 + pnd 2 Au4 


Table 2: The overall complexity of finding an e-optimal solution for (31) when n> p. 


of (30) is 


mm max 
x j £§l,j=i,..., P y^ n 


It 

3 = 1 


IX 


3 \\F 


it f n 

+ - £ Yl, . Xj) (w) 

Z—1 J = 1 Z=1 


(31) 


which is a special case of (26) where q,j = d 2 and m* = 1. xy E 


I >d 2 


and 


vectorization of the matrices Xj and D] respectively, and gj(Xj) = ^ 11 Xj \\ 2 F if Xj E Si 


A] 

2 


olxd 2 


are the 


and 


gj(Xj) = +oo if Xj ^ §+. The applications of this model include matrix trace regression 
et al.[ |2015 and distance metric learning Weinberger and Saul 
Weinberger, |2010| . 


Slawski 


2008, 2009, Parameswaran and 


In Table [2j we compare DSPDC with various methods on the numbers of iterations and per- 
iteration costs when applied to problem (31). We assume n > maxjp, d} and (q,m) = (1,1) 


and omit all the big-O notations for simplicity. For ASVRG-SP, we present the complexity of its 
individual-split version with uniform sampling. When applied to (31) with ( q,m ) = (1,1), DSPDC 


requires solving (29) in each iteration which involves the eigenvalue decomposition of one d x d 
matrix with complexity of 0{d 3 ). To efficiently implement DSPDC, we need to maintain and 
efficiently update either Ax^ or A 1 yW with complexity of 0(d 2 min{ra,p}). When p < n, the 
per-iteration cost of DSPDC in this case is therefore 0(d 3 + pd 2 ) so that the overall complexity for 

DSPDC to find an e-optimal solution of (31) is 0((d 3 +pd 2 ){n+ \Jp) log(l/e)). On the contrary, 
SPDC, ASDCA, APCG and RPDG neeato solve p eigenvalue decompositions per iteration so that 
its the overall complexity is O(pd 3 (n + yj ^y^) log(l/e)) which is higher than that of DSPDC when 

yjA\ t ip < y/A~[d. Without this condition, according to the last column of Table [ 2 J DSPDC still 
has a lower overall complexity than SDCA, SAGA, SVRG and ASVRG-SP. 


5.3 Multi-task Large Margin Nearest Neighbor Problem 


In this section, we show that DSPDC can be applied to the Multi-task Large Margin Nearest 
Neighbor (MT-LMNN) problem Parameswaran and Weinberger 2010|. The key is to appropriately 
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reduce the original form to the matrix risk minimization (31). 


Problem Reformulation To make the paper self-contain ed, we include the introduction of MT - 
LMNN here. Interested readers can find more background in |Parameswaran and Weinberger 2010 


Suppose there are p > 1 ta sks, each being a multi-class classification problem. For example, in 
our empirical study (Section |6.3[ ), we have p = 100 tasks, each being a 10-class image classification 
problem. MT-LMNN aims to learn one Mahalanobis distance metric (defined as a positive semi- 
definite matrix) for each task, so there are totally p metric matrices to be learned. With those 
metrics, the label of a testing point in task j is determined by the majority vote of its ^-nearest 
neighbors defined by the j -th metric. We further assume that the tasks are correlated that all the 
metrics share a common component, in addition to their own matrix. The original formulation of 
MT-LMNN is the following: 


mm 

N>£ s + ,i=o.i,...,p,feR r * 


s.t. 


Y\\Xo-I\\ 2 f 


+ EyH* 

j =i 


ill f 


dj(z u1 Z w ) dj(z u ,Z v ) — 1 ^uvwt 

> 0 , V(u, v , w) G Sj. 


1 p 

- E E “j 

3 —1 ( u,v)eAfj 


3 

^\uvw 


dj (Zu ) Zy ) + ^ ^ ^ £,uvw 

3= 1 ( u,v,w)€Sj 

V? e [p], V(u,v,w) e Sj ( 32 ) 


Now we interpret the notations above. We let z denote a training data point indexed by subscript 
u,v,w etc, Xj G S+ be the metric matrix for task j = 1,2, ...,p and Xq the common component 
shared by all the tasks to reflect the correlations among them. Let J\Tj be the set of every ordered 
pair (u, v ) for task j such that z v is among the l closest points of z u that has the same label as z u , 
and Sj be the set consisting of the triples (u, v, w) such that (u, v ) G Mj and z w is the closest point to 
z u that has a different label. The aforementioned closeness can be measured in Euclidean distance 
or other appropriate methods in the original feature space. We use n := Yl P j=i \-^j\ = Yl P j =i \Sj\ 
to denote the total number of constraints in (32) excluding the non-negativity constraints. Let 
Z j tUV := (z u - z v )(z u - z v ) T for all (u,v) G A/} and Z,- 


^J^UVW 


:= Z 


l J,UW 


- z 


S,uv 


for all (u, v, w) € Sj so 


that the distance dj(z u , z v ) in task j is defined as 


dj(z u , Zy') - \J ( Z U Zy^ (Xj T Aq ){z u Zy) - (Z jY 


UV 1 


xj + X 0 


Note that the metric matrix for task j is Xj + Xq, the sum of the individual matrix Xj and the 


shared component Xq among all the tasks. We can see that in each task, the goal of formulation (32) 


is essentially to minimize the distances of points with the same label (the objective) while enforcing 
the points with different labels to stay away from each other (the constraints). The slack variables 
Cuvw allow for soft constraints in the problem. The regularization term \j\\Xj 
the magnitude of Xj and Aq|| W — I \\ 2 F tunes how close Xq to the identity /. 


X,-|| p,Vj G (p\ controls 


Following the same convention of support vector machine, we can transform the problem (32) 
to an unconstrained form: 


nun 


Xo,...,Jf p eSl 


E||X 0 _/||2, + £^|| A v||^ + l£ £ (Zj, uv ,X j + X 0 ) (33) 


j=i j =i (u,v)eAfj 

1 p 

+vX E «< z j,UVW ) Xj+X o)). 

j =1 ( u,v,w)eSj 

where </>(•) is the hinge loss and we adopt its smoothed version ([3]) with 6 = 1. 
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By introducing the dual variable y we can obtain the following equivalent saddle-point formu¬ 
lation of(33): 


nun max 

x 0 ,...,x p eS^ yeR" 


^0 II v r 11 2 

T — 1 F 


p \ . | P 

+ Evii^iit + i 


3 = 1 


n 




e 


Xj + x 0 


(34) 


1 " 


H— 
n 




E 




3 =1 ( u,v)£jVj 

1 


j ( ^j,uvw j Xj T A o ) ^ ^ 


j=i ( u,v,w)eSj 


V 

n z 

j=l ( u,v,w)£Sj 


4>*(yj,v 


Here, each dual variable yj tU vw corresponds to the matrix Z j_ uvw and the constraint d 2 (z u ,z w ) — 
d 2 (z u ,z v ) > 1 Cuvw hr (J32|) for all j £ [p] and ( u , v , re) £ We stack all the dual variables yj : uvw 
into a single column vector y £ M n and y s represents the sth coordinate of y. Let T(s) and Z s 
represent the task and the outer product y s corresponds to, namely, T(s) = j and Z s := Zj jUW —Zj^ uv 
if the new index s corresponds to the original index ( j,uvw ). Then we have the following more 
compact formulation: 


P 1 n 1 n 

min max V 9 j(Xj) + - Y] + ^0)- Y]<t>*{Vi), 

x 0 ,...,x p £§f »ei" trf n w n 


(35) 


j=o 


2=1 


2=1 


where gj (Xj) := ^||X,,-||f. + ^(C^Xj), with C ? = E( u ,„)eAG- Z J,«« for 3 e [p]> and 5o(^o) := 
4p||Xo — I||p + h (Co, Xq) with Co = J2j =1 Cj. Now, we have reduced the MT-LMNN problem to 
the form of (31) and the customized method is shown in Algorithm [3j The convergence properties 
similar to Theorem [l] and [2] immediately follow. 


6 Numerical Experiments 


In this section, we conduct numerical experiments to compare the DSPDC method with two other 
popular stochastic coordinate methods, SPDC |Zhang and Xiao 2015] and SDCA | Shalev-Shwartz 
and Zhang 2013a on three scenarios. The first two are empirical risk minimizations, with one 


applied on factorized data (see Section 4.2) and the other using matrices as decision variables (see 


Section 5.2), respectively. Those experiments are run on somewhat synthetic data and serve as the 


first step of sanity check for the convergence speed. The third is a multi-task large margin nearest 


neighbor metric learning problem (see Section 5.3) on a real dataset. In a nutshell, we show that 
DSPDC outperforms the competitors in terms of running time in all the experiments. 


6.1 Learning with factorized data 

We first consider the binary classification problem with smoothed hinge loss under the sparse 
recovery setting. Besides, we work on a low-dimensional feature space where random feature 
reduction is applied. That being said, we are solving the problem (19) with 4>i(z) given by <©• 

For the experiments over synthetic data, we first generate a random matrix X £ M nxp with 
Xij following i.i.d. standard normal distribution. We sample a random vector (5 £ with /3j = 1 
for j = 1, 2,..., 50 and fdj = 0 for j = 51, 52,... ,p and use j3 to randomly generate 6* with the 
distribution Pr(6j = lj/3) = 1/(1 + e~ x ?P) and Pr(6j = — 1\/3) = 1/(1 + e x ?P). To construct 
factorized data, we generate a random matrix G £ M. dxp with d < p and G/j following i.i.d. normal 
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Algorithm 3 DSPDC Customized for MT-LMNN 

Input: = X° = A 0 E W lxd , = y° = y° £ M n , step sizes r, a, parameter 6, total 


iteration S, sample sizes 1 < m < n and 1 < q < p + 1. 
Output: X s and y s \ 

W° = Z)i:ie[n],r(i)=j fOT i = 2 , -,P and Wg = £ W« = 

j'=i 

= 0 dxrf for j = 1, 2, 

For t = 0,1, 2,. .., S — 1: 




Randomly choose X t C {1,2, ...,n} and /7i C {0,1, ...,p} with |Zt| = m and \ Jt\ = q. Perform 
the following updates: 


„t+1 


B 5 +1 


wf 1 


^ +1 




t+i 


argmaxjf /Zi, A*- (} + aA - £<£*(£) - ^(/3 - 
/3eiR L x w 1 


Vi 

(% 4+1 - j = !. ->P 

i:ieX t ,T(i)=j 


w t + n B t+l + m^i B f if j ^ Q 

3 m 3 m 3 J / i 


E/=i w* 


if j = 0. 


argmin{i(W5 +1 ,Q)+^((5) + i||<5 - Aj|||) 
QeS^ 1 J 




A* + (0 + 1)(A| +1 - A{), Vj = 0,1, 


y*) 2 } ifiGXt, 
if * ^ X* 


if j G Jt, 


(36) 

(37) 

(38) 

(39) 

(40) 


distribution A7(0, 1/d). Then, the factorized data A = UV for (|Tj) is constructed with U = A G T 
and V = G. 

To demonstrate the effectiveness of these three methods under different settings, we choose 


different values for (n,m,p, q, d) and the regularization parameters (Ai, A2) in (19). The numerical 
results are presented in Figure [l] with the choices of parameters stated at its bottom. Here, the 
horizontal axis represents the running time of an algorithm while the vertical axis represents the 
primal gap in logarithmic scale. According to Figure [lj DSPDC is significantly faster than both 
SPDC and SCDA, under these settings. 

We then conduct the comparison of these methods over three real dataset^j Covtype (n = 
581012, p = 54), RCV1 (n = 20242, p = 47236), and Real-sim (n = 72309, p = 20958). We still 


consider the sparse recovery problem from feature reduction which is formulated as (19) with fa 
defined as <©■ In all experiments, we choose d = 20 to generate the random matrix G and set 
Ai = 10 , A2 = 10 -2 in (19). We choose m and q so that n and p can be either dividable by them 


or has a small division remainder. The numerical performances of the three methods are shown 
in Figure [2] In these three examples, SPDC and DSPDC both outperform SDCA significantly. 
Compared to SPDC, DSPDC is even better on the first two datasets and has the same efficiency 
on the third. 


( http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html 
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Figure 1: For all cases, m = 1. First row: Ai = 1CT 3 , A2 = 1CT 2 ; Second row: Ai = 1CT 6 , A2 = 10~ 5 
First column: ( n,p,q,d ) = (5000,100,50,20); Second column: ( n,p,q,d ) = (10000,100,50,50) 
Third column: (n,p, q, d ) = (10000,500, 50, 50). 





Figure 2: Performance on real datasets. Left: Covtype. Middle: RCV1. Right: Real-sim. 
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6.2 Matrix Risk Minimization 


Next we study the performance of DSPDC for solving the multiple-matrix risk minimization prob¬ 


lem (30). We choose 4>i in fl30| to be (pto and generate as a d x d matrix with entry sampled 


from a standard Gaussian distribution for i = 1,2, ...,n and j = 1, 2, 
true parameter matrix Xj as a d x d identity matrix for j = 1,2 , 


..,p. Then we generate the 
Then we use X, and D( to 


generate bi such that bi = 1 if 


(l+exp{ — (Dl ,X,)}) 

set d = 100 or 200, p = 100, n = 100, A = 0.01. 

We compare the performance of DSPDC, SPDC [Zhang and Xiao 


> 0.5 or bi = — 1 otherwise. In this experiment, 


we 


2015 and SDCA |Shalev- 


Shwartz and Zhang, 2013a| with various sampling settings, and the results are shown in Fig [3j It 


can be easily seen that DSPDC converges much faster than both SPDC and SDCA, in terms of 
running time. The behaviors of these algorithms are due to the fact that, in each iteration, both 
SPDC and SDCA need to take p eigenvalue decompositions of d x d matrix while DSPDC only 
needs q such operations. Since the cost of each eigenvalue decomposition is as expensive as 0(d 3 ), 
the total computation cost saved by DSPDC is thus significant. 








Figure 3: Performance on matrix risk minimization. First row: d = 100. Second row: d = 200. 
Left: (m,q) = (5,50). Middle: (m, q) = (50,5). Right: (m, q) = (20,20). 


6.3 Multi-task Large Margin Nearest Neighbor Problem 


Finally, we compare the performance of different algorithms on the MT-LMNN problem (35). The 
dataset we are using is Amsterdam library of objects aloQ a collection of 108,000 images for small 
objects with 1,000 class labels. Each image contains one small object which can be expressed as an 
extended color histogram of d = 128 dimensions. We adopt the approach similar to |Parameswaran 


and Weinberger, 2010 to generate classification tasks. More specifically, we divide the class labels 


'https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#aloi 
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into 100 pieces, each having 10 labels. In other words, we have 100 metric matrices to learn during 
training, as well as the one shared by all the tasks. The neighborhood size is £ = 3. For each 
task, we randomly select 60% of the data for training, resulting in a training set of 63936 instances. 
Under this setting, the total number of triplet constraints is n = 1142658, which is also the number 
of dual variables. We set Ao = 0.01, Ai = ■ • ■ = \ p = 0.1. 

The comparison is again between DSPDC, SPDC and SDCA under different sampling schemes, 
which is shown in Figure [4} We can observe the similar trends as Figure [3} In particular, DSPDC 
converges much faster to the optimal solution in terms of running time than both SPDC and SDCA, 
under all the sampling settings. The superiority of DSPDC in terms of running time is again due to 
the much less eigenvalue decomposition it does per iteration, which is the benefit brought by primal 
sampling. Indeed, as both SPDC and SDCA need to do full primal coordinate update, they have 
to carry out | times more eigenvalue decompositions than DSPDC. While all those methods have 
similar linear convergence rates, the computational cost per iteration dominates the performance. 




Figure 4: The result of different methods on large margin multi-task metric learning problem with 
ALOI data. There are p = 100 tasks and thus p + 1 = 101 metric matrices to be learned, each 
being 128 x 128. The dual variable (triplet constraint) size is n = 1142658. For left to right, the 
primal and dual sampling sizes of DSPDC are respectively ( q,m ) = (20,2000), ( q,m ) = (20,4000) 
and (g, m) = (40,8000). For SPDC and SDCA, the dual sampling sizes are the same as DSPDC 
while they conduct full primal coordinate update (q = 101). 


7 Conclusion 

We propose a doubly stochastic primal dual coordinate (DSPDC) method for bilinear saddle point 
problem, which captures an important class of regularized empirical risk minimization (ERM) 
problems in statistical learning. We establish the iteration complexity of DSPDC for finding a pair 
of primal and dual solutions with a e-distance to the optimal solution or with a e-objective gap. 
When applied to ERM with factorized data or matrix variables with costly prox-mapping, such as 
the multi-task large margin nearest neighbor metric learning problem, our method achieves a lower 
overall complexity than existing coordinate methods. 
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A Convergence Analysis 

In this section, we provide the detailed proof of the main theoretical results in Section |4j 


A.l Some technical lemmas 

In order to prove Theorem [lj we first present the following two technical lemmas which are extracted 
but extended from Zhang and Xiao 120151. In particular, the second inequality in both lemmas 
are given in Zhang and Xiao 120151 while the first inequality is new and is the key to prove the 
convergence in objective gap. These lemmas establish the relationship between two consecutive 
iterates, and ( IC ( t + 1 ), y(*+ 1 )). 

Lemma 1. Given any x £ M p and v £ M p , if we uniformly and randomly choose a set of indices 
J C {1, 2,... ,p} with | J\ = q and solve an x £ M p with 


. = j argmin aeR { ± Vj a + g 3 {a) + ^(a - Xj ) 2 } if j £ J 

if j i J , 


Xj 


(41) 


then, any x £ we have 


f P . (p ~ g)A 
\2qr 2 q 

> (jL + >^ 

~ \2qr 2 q 


\x - s|| 2 + -—- (g(x) - g(x)) 


P_ 
2 qr 


+ ) E||s — s|| 2 + ^-E||x — s|| 2 + -E (g(x) — g(x)) + — E ( v, x + -(s — x) — x 


-] 

n 


P -l 

q 


and 


> 


( P (p~q ) || *_ |,2 
Ur + 9 ) " " 

/j^ + pA\ ,1^ ^,,2 , P m|| . ^,,2 , aT„.* - , P, 


. , .Ells — s*|| 2 4 -Ells — sir H—E ( v — A 1 y*, x + -(x — x) — x* 

\2qr q J 2qr n \ q 


where the expectation E is taken over J. 

Proof. We prove the first conclusion first. Let s defined as 


. f 1 T r ^ 1 „ „2 

s = argnun < — v x + g(x) +—\\x — x\\ 
ieR" l n 2 t 


Therefore, according to (41), Xj = x 3 if j € J and Xj = Xj if j ^ J. Due to the decomposable 
structure ([2]) of g(x), each coordinate Xj of s can be solved independently. Since g 3 is A-strongly 
convex, the optimality of :ij implies that, for any Xj £ M, 


VjXj 


n 


+ gj{xj) + 


( x j ~ x j) > v j x . 


2 T 


3^3 

n 


+ 9j(xj) + 


( x j - x jT 

2 T 


+ ( 2r + 2 ) _ X ^ 2 ' 
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Since each index j is contained in J with a probability of we have the following equalities 


\2 'it- \2 , f Hi- \2 

E(xj — Xj) = -{Xj — Xj) H- [Xj — Xj) 

E (xj — Xj ) 2 = -(xj — Xj) 2 , 
q~ . P-Q- 

Ex.; = -Xj H- Xj, 

p P 

%i(%) = + 


(43) 

(44) 

(45) 

(46) 


Using these equalities, we can represent all the terms in (42) involving Xj by the terms that only 
contains Xj, Xj and Xj. By doing so and organizing terms, we obtain 


( p , (p-q) x \ / 
V2gr 2y y 1 


Xj ~ x j) 2 + ~—~ {dj( x j) ~~ 9j( x j ))- v j ( + -(Exj — Xj) — Xj 

q n \ q 

( 2 ^ + 2g) ^ + 2^ E ^ J ' “ ^ + g E “ &( X j)) > 


1 


p. 


> 


for any Xj E M. Then, the first conclusion of Lemma [l] is obtained by summing up the inequality 
above over the indices j = 1,... ,p. 

In the next, we prove the second conclusion of Lemma [TJ Choosing X* = x* in (42), we obtain 


VjX j 


+ 9j( x j) + 


(X* - Xj) 2 ^ VjXj 


> 


9j{ x j ) 


(Xj - Xj) 2 


1 A 

27 + 2 


+ I — + W 1 (Xj - Xj 


*\2 


(47) 


n - ' j- 2r n J J 2r 

According to the property Q of a saddle point, the primal optimal solution x* satisfies x* = 
argmin Teffi „. < A(y*) T Ax + g{x) >. Due to the decomposable structure (2) of g(x), each coordinate 


x* of x* can be solved independently. Since gj is A-strongly convex, the optimality of x* implies 


^{A J ,y*)xj + gj(xj) > y*)x* + Sj(x*) + ^(xj - x*) 


*\2 


Summing up (47) and (48) gives us 


7^l(xj - x j) 2 > (7^ + >) (xj - x *) 2 + ^(xj - Xj) 2 + ^(t>j - (A J ,y*))(xj - x*). 


(48) 


(49) 


By equalities (43), (44) and (45), we can represent all the terms in (49) that involve Xj by the terms 
that only contain Xj, Xj and Xj. Then, we obtain 


/ p (p- q)\\ 2 / p p\\ ,. 

(2^ + ) {x > ~ Xi) - (2^ + tJ 


> -^-E(xj — Xj) 2 H—(xj — (A J , y*}) { Xj + -(Exj — Xj) — x 


2 (jT 


n 




Then, the second conclusion is obtained by summing up the inequality above over the indices 
j = l,...,p. □ 
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Lemma 2. Given any v E W 1 and y E W 1 , if we uniformly and randomly choose a set of indices 
I C {1, 2,..., n} with |/| = m and solve an y E M n with 

4>m 


_ I argmax^ 
Ui — \ 


y Vi 

then, any y E M n , we have 

n (n — m) 7 

2 mcr 2 mn 


ififil, 


(50) 


n . 

l|y-y || 2 + -—— (tiiy?) - ^(Vi)) + -^(u,y+—(y-y)-y) 

mn z —' V J n \ m / 

2—1 


> 


and 


(i + i) E|1 ^ 2/1,2 + 

n (n — 771)7 


-112 


1 

+ ^E E («fe. (,+1> ) -«w) 

2=1 


> 


2mcr + mn 
n 7 


llv*-y || 2 


277777 


+ —) E||y - y *|| 2 + —E||y - y\\ 2 - -e/u - Ax*,y + — {y - y) - y*\ . 
mJ 2m<j n \ m / 


where the expectation E is taken over I. 

Proof. The proof is very similar to that of Lemma [lj and thus, is omitted. 


□ 


A.2 Convergence in distance to the optimal solution 

We use Et to represent the expectation conditioned on y^°\x^°\ ... ,y^\x^\ and E t+ the expec¬ 
tation conditioned on y^°\x^°\ ..., y^\ x^\ y( t+1 \ Lemma [l] and Lemma [ 2 ] in the previous section 
provide the basis for the following proposition, which is the key to prove Theorem [lj 

Proposition 1. Let x^\ x^ t+l \ y^ and y( t+1 ) generated as in Algorithm^ for 2 = 0,1,... with 
the parameters r and a satisfying ra = . We have 

( _p_ (p-<i)y \... _ m + (^_ + (" - _ „ (t)|| 2 


\2qr 
9_ 
n 


9 + 2 -— ) IIy* -r 

Zma mn ' 


d\\x^ — X^ ^|| 2 


> 


+_^W-,M)),j / W_ !/ *) + ^ 

fj^ + pX\ Et || x (t+1) _ s *||2 + (JL- + E t \\y^ - y 

\Zqr q J \Zma m/ 


* ||2 


x«|| 2 + 


+—E t (A T (y {t+1) -y*),x (m) - x®) . 
nq \ / 


n 6nq n — m 


2 mcr 4amp 4am 


^t\\y {t+1 ) -y {t) \ 


(51) 


Proof. Let x^\ and y( t+1 ) generated as in Algorithm [I] By the second conclusion of Lemma 

[I] and the tower property EtE t+ = E t , we have 

7 + M)A) ^-^n 2 > ^ + ^V|| a (tfi)_ x Y + ^E t ||s( t +i)-s( t )|| 2 

\2qr q J \2qr q J 2qr 

+ — E* (A T (y < ' t+1 ' > — y*),x® + -( x ^ +1 - ) — x ^) — x*\ . 
n \ q / 
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Similarly, let i / t ' > . y( t+1 ) and generated as in Algorithm [l| By the second conclusion of Lemma 
[2j we have 




2 ma 


mn 


i^ + T ) E«ll!/ ( ‘ +1) - y'f- + 

\Zma ml Anna 

-~E t (a(x w - x*),y (i) + — (y (t+1) - y (t) ) - y*\ . 
n \ m / 


Summing up these two inequalities, we have 


/ _P_ (p~ g)A 

\ 2 qr q 


lx* - x (t) || 2 + 


n (n — 771)7 


2 ma 


mn 


r-7>ii 2 


> 


( A + E.ll* ( ‘ +1) - **l| 2 + + 2 ) E,||7 +1) - »*l| 2 + A e <II i<1+1) - z (,, ll 2 

\2qr q J \2ma mJ 2qr 

+7 ^-E t \\y( t+1 ) - yW|| 2 + -E* (V(y( t+1 ) - y*),x^ + ^(x^ +1 ) - ®«) - x*\ (52) 

2 m cr n \ q / 

--E* (A(xW - x*), yW + -(y (t+1) - y w ) - y*\ . 

By the definition of y( t+1 ' 1 in Algorithm [lj we have y^ t+1 '> — y* = y( t+x ) — y* + n—Ol(y( t + 1 ) — yM) ; 
which implies 

- /a t (^ +1 ) - y*),x (t) + -(x (f+1) - x w ) - x*\ 

n \ q / 

= - (A T (y ( ~ t+1 ) - y* + U ~ m ( y ( t+1 ) - yW)),x (t) + -(x (m) - x (t) ) - x*\ 

n \ m q / 

^A T (y(* +1 ) - y*), x (m) - x (t) ^) + * ^A T (y (m) - y*), x (f) - x*^} (53) 


P^ 

nq 


+ {n-m)p / A T {y (t+ 1) _ y (t))) _ x (t)\ + - y«)),x^ - x*\ 

nmo \ / nm \ / 


nmq 


Similarly, by the definition of x^ in Algorithm [lj we have x^ — x* = x^ — x* + 0(x^ — x^ 1 ^), 
which implies 

- (A(x (t) - x*),yW + -(y (m) - y w ) - y*) 
n \ m / 

= - /A(x w - x* + 0(x (t) - x^)), yW + -(y (t+1) - y w ) - y*\ 

n \ ml 

= — (a(x^ - x*), y (i+1 ) - y^\ + - (a(x® - x*), y (f ) - y*\ 
m \ / n \ / 


+ - (A(xW - x^ 1 )), y( m > - yW) + - (A( 
m \ / n \ 


M _ _(t-i 


x '" 2 - x v ~ 1 ) ),y 2 " 2 - y 


,(t) 


(54) 


According to (53) and (54), the last two terms in the right hand side of (53) within conditional 
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expectation E t can be represented as 

— / A T (y l ' t+1 ' ) — y*), x ^ + -(x < - t+1 ' ) — x^) — x*\ 

n \ q / 

-- (A (x (t) - x*),y (t) + — (y (t+1) - y {t) ) - y*) 
n \ m / 

^ T (y( t+1 ) - y*), x (m) - x (t) ^) + - <^A T (y (m) - y*), x (t) - x*^ 


n 

P_ 

nq 


+ (n ~ m)P - yW)),x( t+ V - xW) + U T (y (t+1) - yW)),s(*> - ®*\ 

nma \ / nm \ / 


nmq 


— — Ia{x® — x*), — — Ia(x^ — x*), y^ — y*\ 

- - (i(x« - s (t - 1 ) ),y (t+1) - y«\ ~ ~ (a(x® - xl*-V),y® - y*) 

= — (A T (y < ' t+1) -y*),x {t+1) - x (t) ) + (A T (y {t+1) - y®)), x (t+1) - 

ng \ / nma \ 

m \ 


(*+l) _ y(t) 


nmq 

— — (a(x^ — x^),y^ — y*) . 


(55) 


In the next, we establish some lower bounds for each of the four terms in (55). 

Note that x^ — x^ 1 ' 1 is a sparse vector with non-zero values only in the coordinates indexed 
by J. Hence, by Young’s inequality, we have 


«)\ > — — ||(v4 J ) T ( j/( t+ i) — y^)\\ 2 - 

\ / m 


\x 


(t) _ (t-1)112 


m 

4 fjp 


4 r/m 
| X W _ a;(*—!) || 2 
4 r/?n 

\\ x (t) __ x (t-i)||2 
At/ m 


(56) 


Here, the second inequality is because — yW has non-zero values only in the coordinates 

indexed by I so that, by the definition ([8]) of A, 

|| (H J )V t+1 ) - y«)|| 2 = || (Aj) T (y? +1) - yf)|| 2 < A||y? +1) - yff = A|| y^ - y«|| 2 , 

and the last equality is because to = 

A similar argument implies 

(^V‘+‘>-„(*,),*<«> -*<<>) > (57) 


At jm 
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Applying (56) and (57) to the right hand side of (55), we have 

— (A T (y( t+1 ^ — y*),x^ + — x^) — x*\ 

n \ q / 

-- ( A(x (t) - x*),y {t) + — {y {t+1) - y {t) ) - y*) 

> A <A T (y (m) - y*), x (t+1) - x^ ^ ^ A(x (i) - x (t ~ 1] ), y {t) - y 


( 58 ) 


(+ n ~ m \ || w (i+l) _ w (t)||2 _ 
\4amp 4am ) 


6\\x^ 


(t) _ x (i-l)||2 


(n — m)p\\x 


(t+1 ) _ x (t) ||2 


4 r 


4 rnq 


\ 4amp 4am 

The conclusion ([51]) is obtained by combining ( [53] ) and the conditional expectation of (58). □ 

Based on Proposition [lj we can prove Theorem [I] 

Theorem QJ We first show how to derive the forms of r and a in (15) and ( |16|) from the last two 
equations in (JldJ). Let Q\ = and Q 2 = The two equations in (14) imply 


Q1Q2 — 


pn 


(np ) 2 A 


4 A - ( V A ’ Qi + l =Q 2 + -. 
4qmX'yra (mq) z nX'y q m 


Solving the values of Q\ and Q 2 from these equations, we obtain 


p 

2qXr 

— Qi — 

l 1 

(- “ 

-s) 

1 + \ 

n 2 

= Q 2 = 

1 1 

( p 

n \ 

1 + \ 

2m^a 

2 1 

u 

m J 


4 (rip) 2 A 
(mq) 2 n\ / y 5 


_ P 
m q 


+ 


4 (rip) 2 A 
(mq) 2 n\-y ’ 


(59) 

(60) 


from which (15) and (16) can be derived. 

To derive the main conclusion of Theorem [l] from Proposition [lj we want to show that the 
following inequalities are satisfied by the choices for 6, r and a in (|14|). 


fj)_^pX^dq > ( p jp-q )A 


\2qr q J p 
n 


\2qr 


+ i)°s > 


n 


2ma mJ p 
6nq n — m 


n (n — m) 7 


2ma 


mn 


2ma 4a mp 4am 
( p (n — m)p\ 6q 
\2qr 4nrq ) p 

In fact, (61) holds since ( [50| implic^ 

P , P 


> 0 , 

> 


e_ 

4 T 


(61) 

(62) 

(63) 

(64) 


1 

(n p\ 

, 1 

n p 

x 

- 1 — 

+ O 

— — _ 

2 

\m q y 

2 

m q 


+ 


n 


pV A 


mq\Jn \7 


, P n . 

= max <-, — > + 

qm 


n 


,p\Z A 


q\pnX r y 


mi 


8 Here and when we show ( 621 , we use the simple fact that \/ a 2 + b 2 < a + b when a > 0 and b > 0 . 
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so that 


(_P_, (p~ g)A \ , ( p pX\ _1 

V2 qr q ) ' \2qr q) 


1 


p < 1 - - 

2q\r + q max <J £, £ J- + 


■ / 2 n't 
\ q ’ ™ / 


Similarly, (62) holds since (60) implies 

n 2 n 1 / n p\ 1 

2 mya m ~ 2 \m qj 2 


n p 

m q 


+ 


n 


,p\J A 


so that 

n (n — 777)7 


2mcr 


+ 


mn 


/ 


n 


7 


2mcr m 


+ - = 1 - 


mqy/nX 'y 

1 


p n 


= max < -, — > + 


n r p\TR. 
mqy/nX'y 

npy/A 


0q 

p ' 


q m J mqyJnX 7 


< 1 - 


2m r ya m 


max { -, — 1 
I j’ml 


rapyA 

mqy/nX'y 


Oq 

p ' 


The inequality (g holds because ^7 “ 3^ -T^^ 2 ^- 4 ^- 4 ^=°> where we use 

the fact that ^ < 1. The inequality (64) holds because - (n 4 ~^ )p ) ^ ^ ) - 

a r j_ i_\ — JL 

U \2t At) ~ At ‘ 

Applying the four inequalities (61), (62), (63) and (64) to the coefficients of (51) from Proposi¬ 
tion jlj leads to for any t > 0, where 


A<‘> = f A + —) II 1 * - l(,) n 2 + (— + -) Hit* - y m f 

\2qr q J 11 " \2ma m) " " 

+ iL u (x m _*«-!>), „«>-yA + fJL- (57A?) 

gn \ / \2 (/t dnrg / 


,(0 _ _(t—l)i|2 


(65) 


Applying this result recursively gives EA® < (—'j A^ 0 ) where 


AW = + 

\2qr q J 


n 


2 m,a m 


+ Y Ily*-j/ ( 0 ) ll 2 

171 / 


because = (x^ -1 ), t/ -1 )). 

Let / be a uniformly random subset of {1, 2,..., n} with |/| = m, i.e., each index in {1, 2,..., n} 
is contained in I with a probability of ™. By Jensen’s inequality and (|8j), we have 

%\\(A J ) T (y m - !/*)|| 2 = mA)) T (vf - y|)l | 2 < E||( Ai) T (yf - „J )|| 2 = ^|| 9 » - i/*|| 2 , 


where the expectation E is taken over I. This result further implies 


nA, 


\\(A J ) T ( y M-y*)\\ 2 <— \\y {t) -y 


★ || 2 


m 


( 66 ) 


Note that x^ — is a sparse vector with non-zero values only in the coordinates indexed by 

J. Hence, by Young’s inequality, we have 

(A(xW-x^),yW-y*) > -^\\(A J ) T (y^ - y^" 2 ^ 1} " 2 


x v 


n 

tA , 

771 


> - — Il 2 / (t) - 2 / 


.* m2 


- W -*» 2 


4t/t7 

_ x (t-l)||2 
4r/n 

| x w _ xh-Ln 2 
4r/n 
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where the second inequality is because of (66) and the last equality is because to = ^2. Apply¬ 
ing (67) to the right hand side of (65) leadsTo 


A<‘> > (jL + ?2 


. + — I \\x* - x (t) || 2 + 

\2qr q 


+ 


> 


(jp_ _ ( n - m )p _ JL\ |u(t) _ x (t- 1) 

\2 qr inrq 4 rqj 


2m(j m 4<r / 
2 


+ 


4?n<T m 


1 \ ll,*-„<‘>f, 


where the second inequalities holds because 


> 

2 ma 4 a — 2 ma 4ma 4ma 


and > 


P 

2qr 


Irq ~ Arq = 0- Then, the conclusion of Theorem |TJ can be obtained as 


2 qr 4 nrq 4 rq — 


( P M 
V 2 ^ r Q 


E||x* - x (t) || 2 + + —) E\\y* - y {t) || 2 < EA W < ( — ^ A (0) 

V 4 mo m / \P J 


< 1 - 


max 


fp n ) | p-Ry/h J [\2qr ' q 

\q : m j qy/mX"f 


( P | pA 


lx* — X 


(0, ll 2 +(A + r ) I?/*—y (0, ll 2 

\2mo m2 


□ 


A.3 Convergence of objective gap 

To establish the convergence of primal-dual gap (Theorem [ 2 ]), we define the following two functions 


Note that 


P{x) = g(x) + ~{y*) T Ax - 
n 


1 


g{x*) + ~{y*) T Ax* 


1 


D(y) = -y 1 Ax* - - V <t>*{yi) 
n n z — J 

i= 1 


n 


1 1 . 

-(y*f Ac*--£>?(»?) 

n n ' 

i —1 


-P(x) > ^||x - x*|| 2 and Z%) < -^||y - y*|| 2 


( 68 ) 

(69) 

(70) 


and 


P(x) - D(y) < P(x) - D(y) (71) 

for any x G M p and any y G M n because of Q and the strong convexity of g(x) and ^ </>* (?/*)• 

Then, we provide the following proposition, which is the key to prove Theorem [2} 

Proposition 2. Let x^\ x^ t+1 \ j/0) anc [ yO+T generated, as in Algorithm^ for t = 0,1,... with 
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the parameters r and a satisfying ra = . We have 

(S- + \ II*. _ x mp + (^- + ("~ m M n„. _ „»)i|2 


2mv + 2 mn I y V 


> 


\2 qr 2 q 

(A {x m - x «~%yM - y *\ + g|k |,| - ^ (, - 1) " 3 + ^ ?(*«>) _ !i^D(„W) 

n \ / 4 t q ~~ 


m 


. 112 


\2qr 


+ (f-~ ( ^f^) E,||*<‘+‘> - x(‘)f + (E.II/+1) - „(‘)f 
\2gT 4nrg / \2ma Aamp 4am J 

+—E t (A T {y {t+l) - y*), x (m) - x (t) \ + -E t P(x^ t+1) ) - —E t L>(y (m) ). (72) 

nq \ / q m 

Proof. Let x^\ x^ t+1 ^ and y^ +1 ) generated as in Algorithm [l] By the first conclusion of Lemma [l] 
and the tower property !EtIEf_|_ = Et, for any x € M p , 

+ ‘ T + ^'VsM) (T3) 

> ^ ^—f E||x^ +1 ^ — x\\ 2 + P—E\\x ( ' t+1 ' 1 — x ^|| 2 + -E ^ 5 (x ( ' t+1 ^) — g(x)j 

+—E / A T y( t+1 \ x^ + ~(x (t+1) — x — x\ . 

n \ q / 

Let y^\ y( t+1 ) a nd x^ generated as in Algorithm [I] By the first conclusion of Lemma [ 2 J we have, 
for any y £ M n , 

x 7 i=1 

1 71 

£ (sb + £) E » y(,+,> - y » 2 + ^ E '' y(,+,> - y(,| » 2 + mS E («fe ( ‘ +1) )-«<*)) 

i —1 

--E (Ax®,y® + -(;y (m) - y {t) ) - y) 

Summing up the inequalities ( [73] ) and ( |74| > and setting (. x,y ) = (x*,y*) yield 
( V- + (P-i) y \ II*. _ *C>||2 + fun 4. (" - m >T ) _ „( t )|,2 


\2qr 2 q 

p-q 


2rna + 2 mn ^ ^ y 


+ 


(W t} ) - g(x*j) + ^ £ (tf (y? } ) - (y?)) 


(75) 


Z=1 


> 


( V , gA 
V2gT 


+E.HXC+ 1 ) - x*n 2 +A e >ii* |,+1) - * (,) ii 2 + (if- +A) e «ii7‘ +1) - »*ii 2 

2q J 2qr \2ma 2m J 

n 

+ ~Et (g(x^) - g(x*)) + -^E f (tf (y? +1) ) - </>*(y*)) + |y (m) - y (t) || 2 

g m i=1 mcr 

+ -E# (A T y^ + 1 \x^ + -(x (m) - x (t) ) - ®*\ - -Et (Ax®,y® + ~{y {t+1) - y (t) ) - y*\ . 
n \ q / n \ m / 
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By the definitions of P(x^), D(y 0)), P(x^ t+1 ^), and D(y^ t+1 '>), (75) is equivalent to 


(-X- + || T . - „«>||2 4. (^- 4 - ll„* - „W ||2 


\2qr ' 2q 

+ P -^P( X U) + n - m 


x — ar 
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2mcr 


+ 


2 mn 


II y*-y ( 
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D(y {t) ) 


> 


(/-+£) E,||*<‘ +1 > - x*|| 2 + -^E t ||x<‘+» - V‘>|| 2 + (A + A E ( |7‘ +1) - VII 2 

\2qr 2q J 2qr \2ma 2m/ 

+-E t P(x^) + -E t D(y^) - -Et U(x® - x*),y® + ~(y (t+1) - y (t) ) - y*) (76) 

q m n \ m / 

+-E t /A T (y^ - y*),x® + -( x^ - x®) - x*\ + —E*||y (m) - y (t) || 2 . 
re \ q / 2mcr 


x * — x^|| 2 + 


re (re — 771)7 


2m a 


2mn 




which, together with (58), implies 

+ 

- -P{x (t) ) - ^D(yW) 

q m 

( 2 r~ + tt) E *N (t+1) - ®*H 2 + ^Ei||x (t+1) - xW || 2 + -E t P{x {t+1) ) - — E t D(y {t+1) ) 

\2 qr 2 qj 2 qr q m 


f V (P ~ g)A 

\2qr 2 q 

p — q (+\ s n — m 


> 


+ — (A 1 - (t/ 4+1 ) — y*),x^ t+1 ^ — — — (^A(x^ — x^ -1 -*), — y*^ 


( Onq 


n — m 

+ - - I II y 


(t+i) (t) 112 OIW"’ x 


(t) _ „(*-!) ||2 


(77 — ?n)p||x 


(t+i) _ x (t)||2 


\Aamp Aom J 4r 4rreg 

The conclusion of the proposition is obtained by organizing the terms of the inequality above. □ 

Based on Proposition [2j we now can prove Theorem [2] 

Theorem We first show that the following inequalities are satisfied according to the choice for 
8 in (18) and the choices for r and a in (14). 
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\2 qr 2q J p 
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2ma 2m J p 
Onq 71 — m 
2mo Aomp Aom 
( p (re — m)p\ 0q 
\2qr Anrq 
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\2qr 2 q 

77 (re — 777)7 
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2777,77 
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Oq 
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(77) 

(78) 

(79) 

(80) 
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Since t and a still satisfy (14) as in Theorem [TJ 
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Therefore, we have shown that (|77|) and (|78|) are satisfied. The inequality (|79|) holds because 
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Recall that P(x) > 0 and D(y) <0 for any a; E and any y E W 1 . Therefore, applying the six 
inequalities (|77j), ( f78] ) , ( [79] ), (80), (81) and (82) to the coefficients of (72) from Proposition [2] leads 
to E 4 A^ +1 ) for any t > 0 , where 
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Applying this result recursively gives EA® < A®, where 

A(01 - (^ + g )» l(0, -*» 2 + (^ + ^)">' , " , -»*» 2 


+^P(x(°)) - -£(y(°>) 
q m 


30 


(84) 






















































because (x^°\y^) = (x( l \y^ 1 ^). Applying (|67j) to the right hand side of ( |83| ) leads to 
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Combining (84), EA® < A® and (85) together, we obtain 
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In the next, we will establish the relationship between P(x W) — D(y^)) and the actual primal-dual 
objective gap P{x^>) — D(y^>). 
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is a convex and smooth function of x. Moreover, its gradient VT’(x) is Lipschitz continuous with 
a Lipschitz constant of an d VF(x*) = yA T y*. As a result, we have 
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According to the definition of the primal and dual objective functions © and © and their 
relationship with the saddle-point problem ([5]), we have 
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where the inequality is due to (87) and (88) and the last equality is due to ([6]). 

Similarly, because g(x) is a A-strong convex function of x, according to Theorem 1 in Nesterov 
12005 again, the function defined as 
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is a concave and smooth function of y. Moreover, its gradient VZ)(x) is Lipschitz continuous with 


a Lipschitz constant of and \7D(y*) = A Ax *. As a result, we have 
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With a derivation similar to (89), we can show 
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Combining (89) and (92) and using the definitions (68) and (69), we obtain 
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The property (70) of P and D and (85) imply 
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Using this inequality and the definition (84) of A^ 0 ^, we obtain 
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Applying this inequality to the right hand size of (96), we 
the new definition (18) of 9. 
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obtain the conclusion of Theorem [2] by 
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